{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting NYC Taxi Fares with Intel Optimizations on Full Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a notebook originally written for Rapids but converted to use Modin on Omnisci."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import glob\n",
    "import matplotlib.pyplot as plt\n",
    "import socket, time\n",
    "import modin.pandas as modin_omni_pd\n",
    "import xgboost as xgb\n",
    "\n",
    "# To install Holoviews and hvplot\n",
    "# conda install -c conda-forge holoviews\n",
    "# conda install -c pyviz hvplot\n",
    "import holoviews as hv\n",
    "from holoviews import opts\n",
    "import numpy as np\n",
    "import hvplot.pandas\n",
    "import hvplot.dask"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inspecting the Data\n",
    "\n",
    "We'll use Modin on Omnisci to load and parse all CSV files into a DataFrame. It makes it 30 files overall."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Cleanup\n",
    "\n",
    "As usual, the data needs to be massaged a bit before we can start adding features that are useful to an ML model.\n",
    "\n",
    "For example, in the 2014 taxi CSV files, there are `pickup_datetime` and `dropoff_datetime` columns. The 2015 CSVs have `tpep_pickup_datetime` and `tpep_dropoff_datetime`, which are the same columns. One year has `rate_code`, and another `RateCodeID`.\n",
    "\n",
    "Also, some CSV files have column names with extraneous spaces in them.\n",
    "\n",
    "Worst of all, starting in the July 2016 CSVs, pickup & dropoff latitude and longitude data were replaced by location IDs, making the second half of the year useless to us.\n",
    "\n",
    "We'll do a little string manipulation, column renaming, and concatenating of DataFrames to sidestep the problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dictionary of required columns and their datatypes\n",
    "must_haves = {\n",
    "    \"pickup_datetime\": \"datetime64[s]\",\n",
    "    \"dropoff_datetime\": \"datetime64[s]\",\n",
    "    \"passenger_count\": \"int32\",\n",
    "    \"trip_distance\": \"float32\",\n",
    "    \"pickup_longitude\": \"float32\",\n",
    "    \"pickup_latitude\": \"float32\",\n",
    "    \"rate_code\": \"int32\",\n",
    "    \"dropoff_longitude\": \"float32\",\n",
    "    \"dropoff_latitude\": \"float32\",\n",
    "    \"fare_amount\": \"float32\",\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean(ddf, must_haves):\n",
    "    # replace the extraneous spaces in column names and lower the font type\n",
    "    tmp = {col: col.strip().lower() for col in list(ddf.columns)}\n",
    "    ddf = ddf.rename(columns=tmp)\n",
    "\n",
    "    ddf = ddf.rename(\n",
    "        columns={\n",
    "            \"tpep_pickup_datetime\": \"pickup_datetime\",\n",
    "            \"tpep_dropoff_datetime\": \"dropoff_datetime\",\n",
    "            \"ratecodeid\": \"rate_code\",\n",
    "        }\n",
    "    )\n",
    "\n",
    "    for col in ddf.columns:\n",
    "        if col not in must_haves:\n",
    "            ddf = ddf.drop(columns=col)\n",
    "            continue\n",
    "        if ddf[col].dtype == \"object\":\n",
    "            ddf[col] = ddf[col].fillna(\"-1\")\n",
    "\n",
    "    return ddf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base_path = \"/yellow-taxi/\"\n",
    "\n",
    "df_2014 = modin_omni_pd.concat(\n",
    "    [\n",
    "        clean(\n",
    "            modin_omni_pd.read_csv(\n",
    "                x, parse_dates=[\" pickup_datetime\", \" dropoff_datetime\"]\n",
    "            ),\n",
    "            must_haves,\n",
    "        )\n",
    "        for x in glob.glob(base_path + \"2014/yellow_*.csv\")\n",
    "    ],\n",
    "    ignore_index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2014.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b> NOTE: </b>We will realize that some of 2015 data has column name as `RateCodeID` and others have `RatecodeID`. When we rename the columns in the clean function, it internally doesn't pass meta while calling map_partitions(). This leads to the error of column name mismatch in the returned data. For this reason, we will call the clean function with map_partition and pass the meta to it. Here is the link to the bug created for that: https://github.com/rapidsai/cudf/issues/5413 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2014.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We still have 2015 and the first half of 2016's data to read and clean. Let's increase our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2015 = modin_omni_pd.concat(\n",
    "    [\n",
    "        clean(\n",
    "            modin_omni_pd.read_csv(\n",
    "                x, parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"]\n",
    "            ),\n",
    "            must_haves,\n",
    "        )\n",
    "        for x in glob.glob(base_path + \"2015/yellow_*.csv\")\n",
    "    ],\n",
    "    ignore_index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2015.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Handling 2016's Mid-Year Schema Change"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In 2016, only January - June CSVs have the columns we need. If we try to read base_path+2016/yellow_*.csv, Dask will not appreciate having differing schemas in the same DataFrame.\n",
    "\n",
    "Instead, we'll need to create a list of the valid months and read them independently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "months = [str(x).rjust(2, \"0\") for x in range(1, 7)]\n",
    "valid_files = [\n",
    "    base_path + \"2016/yellow_tripdata_2016-\" + month + \".csv\" for month in months\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read & clean 2016 data and concat all DFs\n",
    "df_2016 = modin_omni_pd.concat(\n",
    "    [\n",
    "        clean(\n",
    "            modin_omni_pd.read_csv(\n",
    "                x, parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"]\n",
    "            ),\n",
    "            must_haves,\n",
    "        )\n",
    "        for x in valid_files\n",
    "    ],\n",
    "    ignore_index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# concatenate multiple DataFrames into one bigger one\n",
    "taxi_df = modin_omni_pd.concat([df_2014, df_2015, df_2016], ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# taxi_df = taxi_df.persist()\n",
    "taxi_df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "taxi_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory Data Analysis (EDA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we are checking out if there are any non-sensical records and outliers, and in such case, we need to remove them from the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check out if there is any negative total trip time\n",
    "taxi_df[taxi_df.dropoff_datetime <= taxi_df.pickup_datetime].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check out if there is any abnormal data where trip distance is short, but the fare is very high.\n",
    "taxi_df[(taxi_df.trip_distance < 10) & (taxi_df.fare_amount > 300)].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check out if there is any abnormal data where trip distance is long, but the fare is very low.\n",
    "taxi_df[(taxi_df.trip_distance > 50) & (taxi_df.fare_amount < 50)].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using only 2016-01 data for visuals.\n",
    "# taxi_df_cdf = clean(cudf.read_csv(valid_files[0]),must_haves)\n",
    "\n",
    "# Using entire 2016 data for visualization\n",
    "taxi_df_cdf = taxi_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot below visualizes the histogram of trip_distance and we can see some abnormal trip_distance values for some records. Taking this and also the NYC map coordinates into consideration, we will only select records where tripdistance < 500 miles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Histogram using cupy and Holoviews\n",
    "# frequencies, edges = cupy.histogram(x=cupy.array(taxi_df_cdf[\"trip_distance\"]) , bins=20)\n",
    "# hist = hv.Histogram((np.array(edges.tolist()), np.array(frequencies.tolist())))\n",
    "\n",
    "# Histogram using hvplot\n",
    "hist = taxi_df_cdf._to_pandas().hvplot.hist(\"trip_distance\", bins=20, bin_range=(0, 10))\n",
    "\n",
    "# Customizing the plot\n",
    "hist.opts(\n",
    "    xlabel=\"trip distance (miles)\", ylabel=\"count\", color=\"green\", width=900, height=400\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, the plot below visualizes the histogram of fare_amount and we can see some abnormal fare_amount values for some records. Taking this and also the NYC map coordinates into consideration, we will only select records where fare_amount < 500$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Histogram using cupy and Holoviews\n",
    "# frequencies, edges = cupy.histogram(x=cupy.array(taxi_df_cdf[\"fare_amount\"]) , bins=20)\n",
    "# hist = hv.Histogram((np.array(edges.tolist()), np.array(frequencies.tolist())))\n",
    "\n",
    "# Histogram using hvplot\n",
    "hist = taxi_df_cdf._to_pandas().hvplot.hist(\"fare_amount\", bins=20, bin_range=(0, 50))\n",
    "\n",
    "# Customizing the plot\n",
    "hist.opts(xlabel=\"fare amount ($)\", ylabel=\"count\", color=\"green\", width=900, height=400)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Plot the number of passengers per trip. We'll remove the records where passenger_count > 5.\n",
    "# Plotting using Holoviews\n",
    "# bar = hv.Bars(taxi_df_cdf.groupby(\"passenger_count\").size().to_frame().rename(columns={0:\"count\"}))\n",
    "\n",
    "# Plotting using hvplot\n",
    "df_bar = (\n",
    "    taxi_df_cdf.groupby(\"passenger_count\")\n",
    "    .size()\n",
    "    .to_frame()\n",
    "    .rename(columns={0: \"count\"})\n",
    "    .reset_index()\n",
    ")\n",
    "bar = df_bar._to_pandas().hvplot.bar(x=\"passenger_count\", y=\"count\")\n",
    "\n",
    "# Customizing the plot\n",
    "bar.opts(color=\"green\", width=900, height=400)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "EDA visuals and additional analysis yield the filter logic below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "taxi_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# apply a list of filter conditions to throw out records with missing or outlier values\n",
    "taxi_df = taxi_df[\n",
    "    (taxi_df.fare_amount > 1)\n",
    "    & (taxi_df.fare_amount < 500)\n",
    "    & (taxi_df.passenger_count > 0)\n",
    "    & (taxi_df.passenger_count < 6)\n",
    "    & (taxi_df.pickup_longitude > -75)\n",
    "    & (taxi_df.pickup_longitude < -73)\n",
    "    & (taxi_df.dropoff_longitude > -75)\n",
    "    & (taxi_df.dropoff_longitude < -73)\n",
    "    & (taxi_df.pickup_latitude > 40)\n",
    "    & (taxi_df.pickup_latitude < 42)\n",
    "    & (taxi_df.dropoff_latitude > 40)\n",
    "    & (taxi_df.dropoff_latitude < 42)\n",
    "    & (taxi_df.trip_distance > 0)\n",
    "    & (taxi_df.trip_distance < 500)\n",
    "    & ((taxi_df.trip_distance <= 50) | (taxi_df.fare_amount >= 50))\n",
    "    & ((taxi_df.trip_distance >= 10) | (taxi_df.fare_amount <= 300))\n",
    "    & (taxi_df.dropoff_datetime > taxi_df.pickup_datetime)\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reset_index and drop index column\n",
    "taxi_df = taxi_df.reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adding Interesting Features\n",
    "\n",
    "We'll use a Euclidean Distance calculation to find total trip distance, and extract additional useful variables from the datetime fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## add features\n",
    "taxi_df[\"day\"] = taxi_df[\"pickup_datetime\"].dt.day\n",
    "\n",
    "# calculate the time difference between dropoff and pickup.\n",
    "taxi_df[\"diff\"] = taxi_df[\"dropoff_datetime\"].astype(\"int64\") - taxi_df[\n",
    "    \"pickup_datetime\"\n",
    "].astype(\"int64\")\n",
    "\n",
    "taxi_df[\"pickup_latitude_r\"] = taxi_df[\"pickup_latitude\"] // 0.01 * 0.01\n",
    "taxi_df[\"pickup_longitude_r\"] = taxi_df[\"pickup_longitude\"] // 0.01 * 0.01\n",
    "taxi_df[\"dropoff_latitude_r\"] = taxi_df[\"dropoff_latitude\"] // 0.01 * 0.01\n",
    "taxi_df[\"dropoff_longitude_r\"] = taxi_df[\"dropoff_longitude\"] // 0.01 * 0.01\n",
    "\n",
    "taxi_df = taxi_df.drop(\"pickup_datetime\", axis=1)\n",
    "taxi_df = taxi_df.drop(\"dropoff_datetime\", axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "taxi_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dlon = taxi_df[\"dropoff_longitude\"] - taxi_df[\"pickup_longitude\"]\n",
    "dlat = taxi_df[\"dropoff_latitude\"] - taxi_df[\"pickup_latitude\"]\n",
    "taxi_df[\"e_distance\"] = dlon * dlon + dlat * dlat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "taxi_df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "taxi_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pick a Training Set\n",
    "\n",
    "Let's imagine you're making a trip to New York on the 25th and want to build a model to predict what fare prices will be like the last few days of the month based on the first part of the month. We'll use a query expression to identify the `day` of the month to use to divide the data into train and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# since we calculated the h_distance let's drop the trip_distance column, and then do model training with XGB.\n",
    "taxi_df = taxi_df.drop(\"trip_distance\", axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this is the original data partition for train and test sets.\n",
    "X_train = taxi_df[taxi_df.day < 25]\n",
    "\n",
    "# create a Y_train ddf with just the target variable\n",
    "Y_train = X_train[[\"fare_amount\"]]\n",
    "# drop the target variable from the training ddf\n",
    "X_train = X_train[X_train.columns.difference([\"fare_amount\"])]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train the XGBoost Regression Model\n",
    "\n",
    "The wall time output below indicates how long it took to train an XGBoost model over the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dtrain = xgb.DMatrix(X_train, Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Time to Train our XGBoost Model!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "trained_model = xgb.train(\n",
    "    {\n",
    "        \"learning_rate\": 0.3,\n",
    "        \"max_depth\": 8,\n",
    "        \"objective\": \"reg:squarederror\",\n",
    "        \"subsample\": 0.6,\n",
    "        \"gamma\": 1,\n",
    "        \"silent\": True,\n",
    "        \"verbose_eval\": True,\n",
    "        \"tree_method\": \"hist\",\n",
    "    },\n",
    "    dtrain,\n",
    "    num_boost_round=100,\n",
    "    evals=[(dtrain, \"train\")],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = xgb.plot_importance(\n",
    "    trained_model, height=0.8, max_num_features=10, importance_type=\"gain\"\n",
    ")\n",
    "ax.grid(False, axis=\"y\")\n",
    "ax.set_title(\"Estimated feature importance\")\n",
    "ax.set_xlabel(\"importance\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How Good is Our Model?\n",
    "\n",
    "Now that we have a trained model, we need to test it with the 25% of records we held out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test = taxi_df[taxi_df.day >= 25]\n",
    "\n",
    "# Create Y_test with just the fare amount\n",
    "Y_test = X_test[[\"fare_amount\"]]\n",
    "\n",
    "# Drop the fare amount from X_test\n",
    "X_test = X_test[X_test.columns.difference([\"fare_amount\"])]\n",
    "\n",
    "# display test set size\n",
    "X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate Prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate predictions on the test set\n",
    "booster = trained_model\n",
    "prediction = modin_omni_pd.Series(booster.predict(xgb.DMatrix(X_test)))\n",
    "prediction.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prediction = prediction.map_partitions(lambda part: cudf.Series(part)).reset_index(drop=True)\n",
    "actual = Y_test[\"fare_amount\"].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "actual.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate RMSE\n",
    "squared_error = (prediction - actual) ** 2\n",
    "\n",
    "# compute the actual RMSE over the full test set\n",
    "np.sqrt(squared_error.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scikit-Learn Part"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Normalize data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler, StandardScaler\n",
    "\n",
    "scaler_x = MinMaxScaler()\n",
    "scaler_y = StandardScaler()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler_x.fit(X_train)\n",
    "X_train = scaler_x.transform(X_train)\n",
    "X_test = scaler_x.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler_y.fit(Y_train.to_numpy().reshape(-1, 1))\n",
    "Y_train = scaler_y.transform(Y_train.to_numpy().reshape(-1, 1)).ravel()\n",
    "Y_test = scaler_y.transform(Y_test.to_numpy().reshape(-1, 1)).ravel()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train Ridge regression model without  Extension for Scikit-learn*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearnex import unpatch_sklearn\n",
    "\n",
    "unpatch_sklearn()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "model_r_s = Ridge(random_state=123)\n",
    "model_r_s.fit(X_train, Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train Ridge regression model with  Extension for Scikit-learn*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearnex import patch_sklearn\n",
    "\n",
    "patch_sklearn()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "from sklearn.metrics import mean_squared_error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "model_r_p = Ridge(random_state=123)\n",
    "model_r_p.fit(X_train, Y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Let's look RMSE metric of Our Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_p = model_r_p.predict(X_test)\n",
    "y_pred_s = model_r_s.predict(X_test)\n",
    "\n",
    "mse_r_p = mean_squared_error(Y_test, y_pred_p)\n",
    "mse_r_s = mean_squared_error(Y_test, y_pred_s)\n",
    "\n",
    "print(f\"RMSE of patched Ridge: {np.sqrt(mse_r_p)}\")\n",
    "print(f\"RMSE of unpatched Ridge: {np.sqrt(mse_r_s)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see the model with Extension for Scikit-learn * learns much faster and has the same rmse."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
